Our first dataset for the course comes from NASA (https://data.giss.nasa.gov/gistemp/) and contains land surface global monthly means temperature anomalies (compared to the 1951 - 1980 average) from 1880 - present.
The research paper associated with these data is : https://pubs.giss.nasa.gov/docs/2010/2010_Hansen_ha00510u.pdf
Reading in the file into Rstudio. Here, we are reading in a url, which is linked to a google sheets .csv file containing the temperature data.
my.url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRe9Enb9NeP_Ky_ZMcEyb0xNjOnqWs3GsAsEMv_LhF2wKOt9-CKcTS6dDtYEfC3u2lKMefRWLTT8Ejc/pub?output=csv"
temps = read.csv(url(my.url))
#Looking at the dataset
head(temps)
## Year Jan.1 Feb.1 Mar.1 Apr.1 May.1 Jun.1 Jul.1 Aug.1 Sep.1 Oct.1 Nov.1
## 1 1880 -0.79 -0.37 -0.45 -0.63 -0.33 -0.46 -0.45 0.10 -0.48 -0.65 -0.49
## 2 1881 -0.76 -0.59 -0.34 -0.23 0.01 -1.08 -0.50 -0.21 -0.30 -0.43 -0.51
## 3 1882 0.15 -0.08 -0.05 -0.56 -0.34 -0.99 -0.67 -0.08 -0.05 -0.28 -0.37
## 4 1883 -0.65 -0.91 -0.43 -0.28 -0.32 0.47 0.02 -0.12 -0.42 -0.53 -0.66
## 5 1884 -0.56 -0.30 -0.36 -0.90 -1.15 -0.78 -0.80 0.20 -0.33 -0.73 -0.78
## 6 1885 -0.88 -0.75 -0.87 -0.82 -0.45 -0.89 -0.58 -0.12 -0.12 0.01 -0.34
## Dec.1
## 1 -0.51
## 2 -0.09
## 3 -0.64
## 4 -0.36
## 5 -0.96
## 6 0.05
When working with a dataset, it is important to understand the context of the dataset :
In this space, add information that you deem important for understanding the context of this dataset.
Loading in the required libraries :
library(ggplot2) #graphics package
library(tidyr) #data tidying package
library(dplyr) #data manipulation package
library(lubridate)
Let’s look at the dataset. Is it a tidy dataset?
head(temps)
## Year Jan.1 Feb.1 Mar.1 Apr.1 May.1 Jun.1 Jul.1 Aug.1 Sep.1 Oct.1 Nov.1
## 1 1880 -0.79 -0.37 -0.45 -0.63 -0.33 -0.46 -0.45 0.10 -0.48 -0.65 -0.49
## 2 1881 -0.76 -0.59 -0.34 -0.23 0.01 -1.08 -0.50 -0.21 -0.30 -0.43 -0.51
## 3 1882 0.15 -0.08 -0.05 -0.56 -0.34 -0.99 -0.67 -0.08 -0.05 -0.28 -0.37
## 4 1883 -0.65 -0.91 -0.43 -0.28 -0.32 0.47 0.02 -0.12 -0.42 -0.53 -0.66
## 5 1884 -0.56 -0.30 -0.36 -0.90 -1.15 -0.78 -0.80 0.20 -0.33 -0.73 -0.78
## 6 1885 -0.88 -0.75 -0.87 -0.82 -0.45 -0.89 -0.58 -0.12 -0.12 0.01 -0.34
## Dec.1
## 1 -0.51
## 2 -0.09
## 3 -0.64
## 4 -0.36
## 5 -0.96
## 6 0.05
Since we have two variables - month and temperature anomaly in each of the month columns - we will need to split these up to have a “tidy” dataset. To do this, we can use the gather function, since we want to gather multiple columns (all the various months’ columns) into two columns (a column containing the Month value and a column containing the Temperature anomaly value)
temps2 = gather(temps, key = M, value = Temp_Anomaly, Jan.1:Dec.1)
#Creating tidy data - taking columns from Jan.1 to Dec.1 and then putting column names as the key (in a new column M), and the Temperature anomalies for each month into a new Temp_Anomaly column. This is a tidyr function.
#Looking at the outcome
head(temps2)
## Year M Temp_Anomaly
## 1 1880 Jan.1 -0.79
## 2 1881 Jan.1 -0.76
## 3 1882 Jan.1 0.15
## 4 1883 Jan.1 -0.65
## 5 1884 Jan.1 -0.56
## 6 1885 Jan.1 -0.88
Note: We could recreate our original dataset using the spread command - ex. spread(temps2, key = M, value = Temp_Anomaly)
Now we have a tidy dataframe. However, if we plotted the data right now, we would see that it is plotting all of the months on top of each other!
#filtering the dataset to one year, to visualize lack of monthly resolution
test = filter(temps2, Year == 2000)
#plotting data
ggplot(data = test, aes(x = Year, y = Temp_Anomaly)) + geom_point()
This is because R doesn’t know that these points represent different dates within the year 2000. To do this, we need to convert our dates into a date format:
#turning the Year and Month columns into a column that represents a specific date.
temps3 = temps2 %>% mutate(D = paste(Year, M, sep = ".")) %>% mutate(D2 = as_date(D, "%Y.%b.%d"))
#Looking at our new columns
head(temps3)
## Year M Temp_Anomaly D D2
## 1 1880 Jan.1 -0.79 1880.Jan.1 1880-01-01
## 2 1881 Jan.1 -0.76 1881.Jan.1 1881-01-01
## 3 1882 Jan.1 0.15 1882.Jan.1 1882-01-01
## 4 1883 Jan.1 -0.65 1883.Jan.1 1883-01-01
## 5 1884 Jan.1 -0.56 1884.Jan.1 1884-01-01
## 6 1885 Jan.1 -0.88 1885.Jan.1 1885-01-01
#Extracting the month from the new date column and putting it into its own column. This is done to ensure that the month factor is ordered correctly (from January or 01 to December or 12).
temps4 = temps3 %>% mutate(M = month(D2))
#looking at our output
head(temps4)
## Year M Temp_Anomaly D D2
## 1 1880 1 -0.79 1880.Jan.1 1880-01-01
## 2 1881 1 -0.76 1881.Jan.1 1881-01-01
## 3 1882 1 0.15 1882.Jan.1 1882-01-01
## 4 1883 1 -0.65 1883.Jan.1 1883-01-01
## 5 1884 1 -0.56 1884.Jan.1 1884-01-01
## 6 1885 1 -0.88 1885.Jan.1 1885-01-01
Now we can get to plotting!
ggplot(temps4, aes(Temp_Anomaly)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(temps4, aes(x = D2, y = Temp_Anomaly)) + geom_point()
#With trendline
ggplot(temps4, aes(x = D2, y = Temp_Anomaly)) + geom_point() + geom_smooth(method = "lm") + labs(x = "Year", y = expression("Global Average Monthly \n Temperature Anomaly " ( degree~C)))
#color continuous
ggplot(temps4, aes(x = D2, y = Temp_Anomaly)) + geom_point(aes(color = M)) + labs(x = "Year", y = expression("Global Average Monthly \n Temperature Anomaly " ( degree~C)), color = "Month")
#color discrete
ggplot(temps4, aes(x = D2, y = Temp_Anomaly)) + geom_point(aes(color = as.factor(M))) + labs(x = "Year", y = expression("Global Average Monthly \n Temperature Anomaly " ( degree~C)), color = "Month")
#Facetting
ggplot(temps4, aes(x = D2, y = Temp_Anomaly)) + geom_point() + facet_wrap(~M) + labs(x = "Year", y = expression("Global Average Monthly \n Temperature Anomaly " ( degree~C)))
#grouping data by year, then applying mean function to summarize.
temps_year = temps4 %>%
group_by(Year) %>%
summarise(mean_yearly_anom = mean(Temp_Anomaly),
sd_yearly_anom = sd(Temp_Anomaly))
#checking output
head(temps_year)
## # A tibble: 6 × 3
## Year mean_yearly_anom sd_yearly_anom
## <int> <dbl> <dbl>
## 1 1880 -0.4591667 0.2166882
## 2 1881 -0.4191667 0.2999230
## 3 1882 -0.3300000 0.3325658
## 4 1883 -0.3491667 0.3582142
## 5 1884 -0.6208333 0.3693596
## 6 1885 -0.4800000 0.3667300
ggplot(temps_year, aes(x = Year, y = mean_yearly_anom)) +
geom_point() +
geom_errorbar(aes(ymin = mean_yearly_anom - sd_yearly_anom,
ymax = mean_yearly_anom + sd_yearly_anom)) +
labs(x = "Year",
y = expression("Global Average Annual \n Temperature Anomaly " ( degree~C)))
You can manipulate this markdown document to complete tasks 1-5. Make sure to :
We have visualized the relationship between temperature and time - now we want to statistically test the trends we see.
Let’s say we want to test whether or not there is a relationship between temperature and time and have the hypothesis that temperature has been increasing over time. Once we have a research question and hypothesis, we can develop statistical null and alternative hypotheses. For instance :
We can evaluate our null hypothesis through the use of linear regression (parametric). For our purposes : Parametric tests make the assumption of a normal distribution and homoscedacity + non-parametric tests do not make those assumptions.
Both of these tests have assumptions that we need to test before applying the tests. Assumptions are criteria that the data need to meet to ensure that the test allows appropriate analysis of the data.
These are nicely explained here : http://www.biostathandbook.com/linearregression.html
**Note : simulations show that regression (and correlation) are pretty robust to non-normality and heteroscedacity (unevenly distributed variation). This means that even if the data are not normal, the p-value is still valid (ie: will only be 0.05 about 5% of the time if the null hypothesis is true - type I error holds up).
Before we rigorously test assumptions, what evidence do we already have about the normality or non-normality of our dataset? Do we have information about any of the other assumptions?
Using information from http://www.simonqueenborough.info/R/statistics/testing-assumptions, we will test assumptions for the parametric linear regression test.
First we will fit a linear regression to our data, using the lm() command
temp_model = lm(temps4$Temp_Anomaly ~ temps4$Year)
# read as lm(y~x)
# we are using D2, since it provides us with the time points in correct order.
If we look at the struction of the model - str(temp_model) -, we see that it contains a variety of components, including residuals and the fitted model values (ex. the predicted values of the line of best fit). We can extract these components of the model to test assumptions.
1. An assumption of a linear relationship
To do this, we will make a plot of observed (from original dataset) vs. predicted (by model) values
# Here, the observed (dat$Head) on predicted (or fitted: m$fitted.values). The abline command adds a 1:1 line
plot(temps4$Temp_Anomaly ~ temp_model$fitted.values) + abline(0,1)
## numeric(0)
What can we conclude about the assumption?
We can also plot the residual values vs. predicted values. If we have a linear relationship, residual values should be ~symmetrically distributed around a horizontal line, with relatively similar variance across predicted values.
# Here, the residuals (m$residuals) on predicted (or fitted: m$fitted.values). The abline command here adds a horizontal line at y = 0.
plot(temp_model$residuals ~ temp_model$fitted.values) + abline(h = 0)
## numeric(0)
With this graph, what additional information do we have about the assumption?
We can also plot residuals vs. the independent variable in our dataset (time). Again, residuals should be evenly distributed with similar variance across the independent variable.
plot(temp_model$residuals ~ temps4$Year)
2. Homoscedacity (similar variation of values measured)
These last two graphs also let us evaluate homoscedacity. Data that show homoscedacity will have residuals that are evenly distributed across the data. Does our data show this?
We can statistically test for homoscedascity using Bartlett’s test. The null hypothesis of Bartlett’s test is that the variances across samples is the same.
bartlett.test(Temp_Anomaly ~ Year, data=temps4)
##
## Bartlett test of homogeneity of variances
##
## data: Temp_Anomaly by Year
## Bartlett's K-squared = 403.03, df = 137, p-value < 2.2e-16
What can we conclude?
3. Normality of data + residuals
We can look at the distribution of the data via histogram :
hist(temps4$Temp_Anomaly)
We can plot our Temperature anomaly values vs the normal distribution. If all sample values match 1:1 with the normal distribution, we can assume normality:
qqnorm(temps4$Temp_Anomaly)
Finally, we can statistically test the assumption of normality using the Shapiro-Wilk test, which has the null hypothesis that the population is normally distributed.
shapiro.test(temps4$Temp_Anomaly)
##
## Shapiro-Wilk normality test
##
## data: temps4$Temp_Anomaly
## W = 0.97248, p-value < 2.2e-16
4. An assumption of independence of y values and residuals associated with y values
The acf() function computes estimates of autocorrelation. The X axis corresponds to the lags of the residual, increasing in steps of 1. The very first line (to the left) shows the correlation of the residual with itself (Lag0), therefore, it will always be equal to 1.
If the residuals were not autocorrelated, the correlation (Y-axis) from the immediate next line onwards will drop to a near zero value below the dashed blue line (significance level). If values are above the blue line, autocorrelations are statistically significantly different from zero.
acf(temp_model$residuals)
Although our data does violate some of the assumptions of linear regression, we will continue to evaluate our statistical hypothesis using our data. This decision is supported by the relative robustness of linear regression to non-normality and heteroscedascity, as well as use of linear regression in analyses by the International Panel on Climate Change (IPCC) (https://www.ipcc.ch/pdf/assessment-report/ar5/wg1/WG1AR5_Chapter02_FINAL.pdf, see pg.22), recommendation by the National Center for Atmospheric Research (NCAR) (https://climatedataguide.ucar.edu/climate-data-tools-and-analysis/trend-analysis), and use in other scientific publications looking at temperature over time (https://journals.ametsoc.org/doi/full/10.1175/JCLI-D-15-0032.1)
To see the output of our linear regression, we use :
summary(temp_model)
##
## Call:
## lm(formula = temps4$Temp_Anomaly ~ temps4$Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92111 -0.16804 -0.00664 0.15727 1.07587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.887e+01 2.889e-01 -65.32 <2e-16 ***
## temps4$Year 9.698e-03 1.482e-04 65.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2403 on 1654 degrees of freedom
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7212
## F-statistic: 4281 on 1 and 1654 DF, p-value: < 2.2e-16
We can check out the fit our of linear regression - using the method “lm” in the code below will apply the lm model to our specified x and y in the main aes() expression :
ggplot(temps4, aes(x = D2, y = Temp_Anomaly)) +
geom_point() + geom_smooth(method = "lm") +
labs(x = "Year", y = expression("Global Average Monthly \n Temperature Anomaly " ( degree~C)))
We could assess our data non-parametrically via the Kendall-Thiel test - a nonparametric version of linear regression.
library(mblm)
model.temp = mblm(Temp_Anomaly ~ Year,
data=temps4)
summary(model.temp)
##
## Call:
## mblm(formula = Temp_Anomaly ~ Year, dataframe = temps4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90954 -0.11059 0.05448 0.22099 1.09043
##
## Coefficients:
## Estimate MAD V value Pr(>|V|)
## (Intercept) -17.142394 5.890230 832 <2e-16 ***
## Year 0.008781 0.003006 1369500 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2509 on 1654 degrees of freedom
How does this nonparametric test compare to our parametric result?
We can also look at non-linear trends in our data. The loess fit is a locally weighted least squares regression, meaning that it weighs local data in determining the predicted y values.
ggplot(temps4, aes(x = D2, y = Temp_Anomaly)) + geom_point() + geom_smooth(method = "loess")
The loess suggests that a polynomial fit may be better. We can fit a polynomial via :
#poly fits a polynomial to the data, with the degree specified after the comma (3).
temp_model_poly = lm(temps4$Temp_Anomaly ~ poly(temps4$Year, 3))
#looking at the summary of the model
summary(temp_model_poly)
##
## Call:
## lm(formula = temps4$Temp_Anomaly ~ poly(temps4$Year, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02129 -0.13455 0.00668 0.13522 0.98904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.029354 0.004985 5.888 4.72e-09 ***
## poly(temps4$Year, 3)1 15.721397 0.202866 77.496 < 2e-16 ***
## poly(temps4$Year, 3)2 4.510878 0.202866 22.236 < 2e-16 ***
## poly(temps4$Year, 3)3 2.673800 0.202866 13.180 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2029 on 1652 degrees of freedom
## Multiple R-squared: 0.8016, Adjusted R-squared: 0.8012
## F-statistic: 2225 on 3 and 1652 DF, p-value: < 2.2e-16
#Are our residuals more normal?
plot(temp_model_poly$fitted.values,temp_model_poly$residuals)
#Graphing our polynomial fit
ggplot(temps4, aes(x = Year, y = Temp_Anomaly)) + geom_point() + stat_smooth(formula = y ~ poly(x, 3), se=TRUE, method="lm")
What if we wanted to evaluate if warming was occuring more rapidly over time? Come up with a way to test this question and carry out the analysis. Once done, evaluate the question, using evidence you have collected.